Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Administrator, Curator, or StaffTyler Date: Mar 19, 2011 11:44am
Forum: etree Subject: Re: a million songs yet?

"hi tyler,

no we don't use databases here a lot and so the #songs is not stored in any of our DB tables
and not in our search engine.

so they'd need to crawl us unfortunately.

one nicest way to get this information from us would be to get a list of all identifiers in LMA,
and then to crawl the _files.xml with a loop over identifiers like:
http://www.archive.org/download/IDENTIFIER/IDENTIFIER_files.xml

and then prolly count the "original" files that appear to be audio files based on or
suffix (or both)

i'd be happy to put the #songs stat up on LMA if they get a number for us 8-)
--tracey"

Reply to this post
Reply [edit]

Poster: xtifr Date: Mar 19, 2011 4:47pm
Forum: etree Subject: Re: a million songs yet?

Ok, that sounds like a good approach. It'll still require about 90k file downloads plus all the accesses required to figure out where the files are, but at least they'll be small files. I'll start with some small-scale experimentation for proof of concept, and set it up so that I can throttle it back if necessary when I do a full run.

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or Staffbrewster Date: Mar 20, 2011 11:35am
Forum: etree Subject: Script to download files from lots of items on the Archive

Hank Bromley and the Collections group have gotten good at using wget to download files from lots of items on the Archive. I have included our internal documentation file on how to do it. Wish it were a web form, but we just have never gotten to it (any volunteers?).

This should be helpful to figure out how to download all the files.xml files, for instance.

I have also included a list of all public etree items including the grateful dead concerts. it makes over 97,000!

-brewster


Attachment: etree-identifiers.csv.zip
Attachment: WgetBulkDownloadFromArchive.zip

Reply to this post
Reply [edit]

Poster: xtifr Date: Mar 20, 2011 11:53am
Forum: etree Subject: Re: Script to download files from lots of items on the Archive

Thanks. I'm already pretty solid with wget (and curl), and in any case, for this task, I plan to use python's built-in urllib instead of spawning thousands of subprocesses. The list of identifiers will be very handy, though.

(I do know how to get identifiers from the advanced search API too, but since someone's already done the work, I have no problems taking advantage of it.) :)

Reply to this post
Reply [edit]

Poster: xtifr Date: Mar 23, 2011 1:17pm
Forum: etree Subject: Counting has started!

I've started running the script. Early results suggest that LMA shows probably average about 15 songs, which implies that the total will be well over a million!

It's going to take a while to get the final results because A) I've got it throttled to pause for a second between shows—97000 shows / 3600 (secs-per-hour) = 26 hours of idleness alone—and B) my Internet connection gets flakey in the rain. But I'll post the grand total as soon as it's available.

I don't expect it to, but if my hammering becomes a problem, please let me know at my gmail account, "xtifr.w". Thanks. (I do need to update my archive account one of these days.)

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or StaffTyler Date: Mar 25, 2011 7:14pm
Forum: etree Subject: Re: Counting has started!

this is awesome news! I can't wait to see what it ends up being. crawl it away!